Post4
ManiShankerKamarapu
Amazon Review analysis
Author

Mani Shanker Kamarapu

Published

November 5, 2022

Introduction

In the last post, I have acquired the data from Amazon and done pre-processing and converted into corpus and done a word cloud. In this blog I plan to tidy data more and analysis data using visualizations.

Loading the libraries

Code
library(polite)
library(rvest)
Warning: package 'rvest' was built under R version 4.2.2
Code
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.2.2
Code
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
Code
library(tidyverse)
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
✔ purrr   0.3.5      
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks plotly::filter(), stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
Code
library(stringr)
library(quanteda)
Package version: 3.2.3
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 8 of 8 threads used.
See https://quanteda.io for tutorials and examples.
Code
library(tidyr)
library(RColorBrewer)
library(quanteda.textplots)
library(wordcloud)
library(wordcloud2)
library(devtools)
Loading required package: usethis
Code
library(quanteda.dictionaries)
library(quanteda.sentiment)

Attaching package: 'quanteda.sentiment'

The following object is masked from 'package:quanteda':

    data_dictionary_LSD2015
Code
knitr::opts_chunk$set(echo = TRUE)

Reading the data

Code
reviews <- read_csv("amazonreview.csv")
New names:
Rows: 46450 Columns: 6
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): review_title, review_text, review_star, ASIN dbl (2): ...1, page
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
reviews

Summary of the data

Code
summary(reviews)
      ...1       review_title       review_text        review_star       
 Min.   :    1   Length:46450       Length:46450       Length:46450      
 1st Qu.:11613   Class :character   Class :character   Class :character  
 Median :23226   Mode  :character   Mode  :character   Mode  :character  
 Mean   :23226                                                           
 3rd Qu.:34838                                                           
 Max.   :46450                                                           
      page           ASIN          
 Min.   :  1.0   Length:46450      
 1st Qu.: 97.0   Class :character  
 Median :194.0   Mode  :character  
 Mean   :195.2                     
 3rd Qu.:291.0                     
 Max.   :400.0                     
Code
str(reviews)
spc_tbl_ [46,450 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ...1        : num [1:46450] 1 2 3 4 5 6 7 8 9 10 ...
 $ review_title: chr [1:46450] "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  \r\n  \r\n    Refreshing and Edgy with Great Characters\r\n  \r\n" "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  \r\n  \r\n    Well plotted and paced; excellent, fresh fantasy tale\r\n  \r\n" "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  \r\n  \r\n    Compelling and Riveting!\r\n  \r\n" "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  \r\n  \r\n    Let the Games Begin\r\n  \r\n" ...
 $ review_text : chr [1:46450] "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  \r\n  \r\n    I love fantasy; ever since I was a kid, stories set in creative w"| __truncated__ "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  \r\n  \r\n    First off, I'm a heavy duty fan of GRRM. I've read over a 100 dif"| __truncated__ "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  \r\n  \r\n    I am writing this review as an individual who watched Game of Thr"| __truncated__ "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  \r\n  \r\n    If you're going to consider reading `The Song of Ice and Fire' se"| __truncated__ ...
 $ review_star : chr [1:46450] "5.0 out of 5 stars" "5.0 out of 5 stars" "5.0 out of 5 stars" "5.0 out of 5 stars" ...
 $ page        : num [1:46450] 1 1 1 1 1 1 1 1 1 1 ...
 $ ASIN        : chr [1:46450] "B0001DBI1Q" "B0001DBI1Q" "B0001DBI1Q" "B0001DBI1Q" ...
 - attr(*, "spec")=
  .. cols(
  ..   ...1 = col_double(),
  ..   review_title = col_character(),
  ..   review_text = col_character(),
  ..   review_star = col_character(),
  ..   page = col_double(),
  ..   ASIN = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
Code
glimpse(reviews)
Rows: 46,450
Columns: 6
$ ...1         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ review_title <chr> "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  \r\n  \r\n    Refreshing…
$ review_text  <chr> "\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  \r\n  \r\n    I love fan…
$ review_star  <chr> "5.0 out of 5 stars", "5.0 out of 5 stars", "5.0 out of 5…
$ page         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ ASIN         <chr> "B0001DBI1Q", "B0001DBI1Q", "B0001DBI1Q", "B0001DBI1Q", "…

Pre-processing function

Code
clean_text <- function (text) {
  str_remove_all(text," ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>% 
    # Remove mentions
    str_remove_all("@[[:alnum:]_]*") %>% 
    # Replace "&" character reference with "and"
    str_replace_all("&amp;", "and") %>%
    # Remove punctuation
    str_remove_all("[[:punct:]]") %>%
    # remove digits
    str_remove_all("[[:digit:]]") %>%
    # Replace any newline characters with a space
    str_replace_all("\\\n|\\\r", " ") %>%
    # remove strings like "<U+0001F9F5>"
    str_remove_all("<.*?>") %>% 
    # Make everything lowercase
    str_to_lower() %>%
    # Remove any trailing white space around the text and inside a string
    str_squish()
}

Tidying the data

Code
reviews$clean_text <- clean_text(reviews$review_text) 
reviews <- reviews %>%
  drop_na(clean_text)
reviews

Removing unnecessary columns

Code
reviews <- reviews %>%
  select(-c(...1, page, review_text))
reviews

Pre-processing the title variable

Code
reviews$review_title <- reviews$review_title %>%
  str_remove_all("\n")
reviews

Converting star of reviews from character to numeric

Code
reviews$review_star <- substr(reviews$review_star, 1, 3) %>%
  as.numeric()
  reviews

Frequency of stars

Code
reviews %>%
  group_by(review_star) %>%
  count()
Code
p <- reviews %>%
  group_by(review_star) %>%
  ggplot(aes(review_star)) +
  geom_bar() +
  ggtitle("Frequency per star")
ggplotly(p)
Warning: The following aesthetics were dropped during statistical transformation:
x_plotlyDomain
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Unique set of ASIN numbers

Code
reviews %>%
  select(ASIN) %>%
  unique()

Adding new variable book title to the reviews

Code
reviews <- reviews %>%
  mutate(book_title = case_when(ASIN == "B0001DBI1Q" ~ "A Game of Thrones: A Song of Ice and Fire, Book 1", 
                                ASIN == "B0001MC01Y" ~ "A Clash of Kings: A Song of Ice and Fire, Book 2", 
                                ASIN == "B00026WUZU" ~ "A Storm of Swords: A Song of Ice and Fire, Book 3", 
                                ASIN == "B07ZN4WM13" ~ "A Feast for Crows: A Song of Ice and Fire, Book 4", 
                                ASIN == "B005C7QVUE" ~ "A Dance with Dragons: A Song of Ice and Fire, Book 5", 
                                ASIN == "B000BO2D64" ~ "Twilight: The Twilight Saga, Book 1", 
                                ASIN == "B000I2JFQU" ~ "New Moon: The Twilight Saga, Book 2", 
                                ASIN == "B000UW50LW" ~ "Eclipse: The Twilight Saga, Book 3", 
                                ASIN == "B001FD6RLM" ~ "Breaking Dawn: The Twilight Saga, Book 4 ", 
                                ASIN == "B07HHJ7669" ~ "The Hunger Games", 
                                ASIN == "B07T6BQV2L" ~ "Catching Fire: The Hunger Games", 
                                ASIN == "B07T43YYRY" ~ "Mockingjay: The Hunger Games, Book 3"))
reviews

Frequency of each book title

Code
reviews %>%
  group_by(book_title) %>%
  count()

Frequency of stars for each book title

Code
reviews %>%
  group_by(book_title) %>%
  summarise(star = sum(review_star))
Code
reviews %>% 
  group_by(book_title) %>%
  summarise(star = sum(review_star)) %>%
  ggplot(aes(book_title, star)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 30, vjust = 1, hjust=1)) +
  ggtitle("No of stars per book")

Frequency of each type of star per book title

Code
reviews %>%
  group_by(book_title,review_star) %>%
  count()
Code
p <- reviews %>%
  group_by(book_title,review_star) %>%
  count() %>%
  ggplot(aes(review_star, n, color = book_title)) +
  geom_line() +
  ggtitle("Book vs Freq of each star") +
  xlab("Type of star") +
  ylab("Frequency")
ggplotly(p)

The plot is not much clear, let’s plot it individually to be more clear

Code
p <- reviews %>%
  group_by(book_title,review_star) %>%
  count() %>%
  ggplot(aes(review_star, n)) +
  geom_line() +
  facet_wrap(vars(book_title), ncol = 2) +
  ggtitle("Book vs Freq of each star") +
  xlab("Type of star") +
  ylab("Frequency")
ggplotly(p)

Adding new variable series title to the reviews

Code
reviews <- reviews %>%
  mutate(series_title = case_when(ASIN == "B0001DBI1Q" ~ "A Song of Ice and Fire", 
                                ASIN == "B0001MC01Y" ~ "A Song of Ice and Fire", 
                                ASIN == "B00026WUZU" ~ "A Song of Ice and Fire", 
                                ASIN == "B07ZN4WM13" ~ "A Song of Ice and Fire", 
                                ASIN == "B005C7QVUE" ~ "A Song of Ice and Fire", 
                                ASIN == "B000BO2D64" ~ "The Twilight Saga", 
                                ASIN == "B000I2JFQU" ~ "The Twilight Saga", 
                                ASIN == "B000UW50LW" ~ "The Twilight Saga", 
                                ASIN == "B001FD6RLM" ~ "The Twilight Saga", 
                                ASIN == "B07HHJ7669" ~ "The Hunger Games", 
                                ASIN == "B07T6BQV2L" ~ "The Hunger Games", 
                                ASIN == "B07T43YYRY" ~ "The Hunger Games"))
reviews

Frequency of each series title

Code
reviews %>%
  group_by(series_title) %>%
  count()

Frequency of stars by series

Code
reviews %>%
  group_by(series_title) %>%
  summarise(star = sum(review_star))
Code
reviews %>% 
  group_by(series_title) %>%
  summarise(star = sum(review_star)) %>%
  ggplot(aes(series_title, star)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 30, vjust = 1, hjust=1)) +
  ggtitle("No of stars per series")

Frequency of each type of star per series title

Code
reviews %>%
  group_by(series_title,review_star) %>%
  count()
Code
p <- reviews %>%
  group_by(series_title,review_star) %>%
  count() %>%
  ggplot(aes(review_star, n, color = series_title)) +
  geom_line() +
  ggtitle("Series vs Freq of each star") +
  xlab("Type of star") +
  ylab("Frequency")
ggplotly(p)

Tokenization of data

Code
# Conerting the text into corpus
text_corpus <- corpus(c(reviews$clean_text)) 
# Converting the text into tokens
text_token <- tokens(text_corpus, remove_punct=TRUE, remove_numbers = TRUE) %>% 
  tokens_select(pattern=stopwords("en"), 
                selection="remove")
text_token
Tokens consisting of 46,447 documents.
text1 :
 [1] "love"      "fantasy"   "ever"      "since"     "kid"       "stories"  
 [7] "set"       "creative"  "worlds"    "featuring" "varied"    "groups"   
[ ... and 985 more ]

text2 :
 [1] "first"     "im"        "heavy"     "duty"      "fan"       "grrm"     
 [7] "ive"       "read"      "different" "fantasy"   "authors"   "time"     
[ ... and 484 more ]

text3 :
 [1] "writing"    "review"     "individual" "watched"    "game"      
 [6] "thrones"    "hbo"        "picking"    "book"       "series"    
[11] "mind"       "may"       
[ ... and 679 more ]

text4 :
 [1] "youre"      "going"      "consider"   "reading"    "`"         
 [6] "song"       "ice"        "fire"       "series"     "prepared"  
[11] "investment" "books"     
[ ... and 790 more ]

text5 :
 [1] "thorough"    "review"      "asoiaf"      "amazon"      "outlines"   
 [6] "many"        "significant" "points"      "thought"     "id"         
[11] "put"         "review"     
[ ... and 697 more ]

text6 :
 [1] "long"      "journey"   "end"       "sight"     "ill"       "try"      
 [7] "avoid"     "much"      "plot"      "beyond"    "needed"    "mentioned"
[ ... and 791 more ]

[ reached max_ndoc ... 46,441 more documents ]
Code
# Converting tokens into Document feature matrix
text_dfm <- dfm(text_token)
text_dfm
Document-feature matrix of: 46,447 documents, 74,610 features (99.95% sparse) and 0 docvars.
       features
docs    love fantasy ever since kid stories set creative worlds featuring
  text1    1      12    2     1   1       3   1        1      2         1
  text2    2       8    0     0   0       1   1        0      0         0
  text3    1       1    0     0   0       0   0        0      0         0
  text4    0       7    0     1   0       1   0        0      0         0
  text5    0      12    1     1   0       0   0        0      0         0
  text6    0       0    0     0   0       1   0        0      0         0
[ reached max_ndoc ... 46,441 more documents, reached max_nfeat ... 74,600 more features ]
Code
# Total no of tokens
sum(ntoken(text_token))
[1] 2259952
Code
# Summary of the corpus
summary(text_corpus)
Code
# Finding the frequency of each word
word_counts <- as.data.frame(sort(colSums(text_dfm),dec=T))
colnames(word_counts) <- c("Frequency")
word_counts$word <- row.names(word_counts)
word_counts$Rank <- c(1:ncol(text_dfm))
word_counts 
Code
word_counts %>%
  head(40) %>%
  mutate(word = reorder(word, Frequency)) %>%
  ggplot(aes(word, Frequency)) +
  geom_bar(stat = "identity") +
  ylab("Occurrences") +
  coord_flip()

Code
# Trimming the dfm 
text_df <- dfm_trim(text_dfm, min_termfreq = 50, docfreq_type = "prop")
# create fcm from dfm
text_fcm <- fcm(text_df)
text_fcm
Feature co-occurrence matrix of: 3,795 by 3,795 features.
          features
features    love fantasy ever since kid stories  set creative worlds groups
  love     13025    2234 3670  2395 337    1701 1155      224    161     72
  fantasy      0    1779  884   584  45     554  456       76     99     20
  ever         0       0  722   758  90     408  347       64     55     17
  since        0       0    0   455  60     271  309       36     28     15
  kid          0       0    0     0  48      23   27        6      5      5
  stories      0       0    0     0   0     313  190       28     33     33
  set          0       0    0     0   0       0  237       26     29     11
  creative     0       0    0     0   0       0    0       22      3      2
  worlds       0       0    0     0   0       0    0        0      8      4
  groups       0       0    0     0   0       0    0        0      0      2
[ reached max_feat ... 3,785 more features, reached max_nfeat ... 3,785 more features ]

Network plot

Code
# pull the top features
top_features <- names(topfeatures(text_fcm, 50))
# retain only those top features as part of our matrix
even_text_fcm <- fcm_select(text_fcm, pattern = top_features, selection = "keep")
# compute size weight for vertices in network
size <- log(colSums(even_text_fcm))
# create plot
textplot_network(even_text_fcm, vertex_size = size / max(size) * 2)

Wordcloud

Code
textplot_wordcloud(text_dfm, min_size = 1.5, max_size = 4, random_order = TRUE, max_words = 150, min_count = 50, color = brewer.pal(8, "Dark2") )

Further study

I will try sentimental analysis using multiple lexicons and compare which is more suitable.